Adaptive pointwise-pairwise learning to rank

ABSTRACT

A method of ranking items for a given entity uses sets of triplets &lt;u, i, j&gt;, each set of triplets including an entity u and a pair of items i and j with a known relative relevance for entity u, to train a learnable scoring function ƒ and to learn optimized values of a first set θ of learnable parameters. The training includes optimizing a loss function depending on θ, on a second set of learnable parameters θ g , and on a probability of having the item i preferred to the item j by the entity u. The probability defines a continuum between pointwise and pairwise ranking of items through a learnable mixing function depending on θ g . After training, the trained learnable scoring function ƒ is applied to all input pairs &lt;u′, i′&gt; to rank all items i′ for an entity u′.

PRIORITY INFORMATION

Pursuant to 35 U.S.C. § 119 (a), this application claims the benefit ofearlier filing date and right of priority to European Patent ApplicationNumber EP 20305582, filed on Jun. 3, 2020, the contents of which arehereby incorporated by reference in their entirety.

BACKGROUND

Learning-to-rank or machine-learned-ranking is the application ofmachine learning in the construction of ranking models for informationretrieval systems. For this purpose, a ranking model is first trained onthe basis of training data, which consists of lists of “items” (e.g.,any types of documents or objects) associated to an (i.e., for a given)“entity” that may be any one of a context, a user, or a user query, withsome known partial order specified between the items in each list.

The items can be ordered through a numerical or ordinal score or througha binary judgment (e.g. “relevant” or “not relevant”) for each item. Theranking model's purpose is to model the relevance relationship(represented by a scoring function) between an item and any one of acontext, a user, or a user query, thereby allowing to rank any set ofitems for a given context, a user, or a user query, in a way which is“similar” to rankings known from the training data.

In the rest of the document, for the sake of readability, it will mostlybe referred to a “user” instead of an “entity” or any one of a context,a user, or a user query. However, it is to be understood that theseterms are used interchangeably throughout the document and that anyreference to a user may therefore equally refer to a context or to a(user) query. Further, any reference to a “set of users” refers in factmore generally to a “set of entities” or “set of any one of a context, auser, or a user query,” so that a single set of users may in factcontain different types of entities (contexts, users, and/or userqueries).

For the convenience of learning-to-rank algorithms, items and users areusually represented by numerical vectors, which are called embeddings.In more detail, a user can be associated to measurable observations(e.g., its demographics, . . . ) that are called user features (or,equivalently, user feature vector). In the same way, an item can beassociated to measurable observations (e.g., its content, producer,price, creation time, . . . ) that are called item features (or,equivalently, item feature vector).

However, learning-to-rank or machine-learned-ranking machine learningalgorithms tend to use “higher semantic level” representations, whichare more adapted to the task of obtaining a scoring function that allowsthe ordering of the items according to their relevance to a given user.These higher level representations are called “embeddings” (andsometimes latent factors, because the higher level representations arenot directly observed and/or measured). In the rest of the document, theterm embedding will be used. In particular, {right arrow over (u)}, willdenote the user embedding and {right arrow over (i)} will denote theitem embedding.

Ranking is particularly useful in information retrieval problems, butmay also be used in other fields of computer science as diverse asmachine translation, computational biology, recommender systems,software engineering, or robotics.

The learning-to-rank models are classified in three groups of approachesin terms of input representation and loss function used to train themodels: the pointwise, pairwise, and listwise approaches.

In the pointwise approach, each <user, item> pair in the training datahas a relevance grade. Then the learning-to-rank problem can beapproximated by a classification problem (when relevance grades arebinary) or by an ordinal regression problem: given a single <user, item>pair, to predict its score.

In the pairwise approach, the learning-to-rank problem is approximatedby a pair classification problem: learning a binary classifier that cantell, for a given user, which item is better in a given pair of items.

Finally, in the listwise approach, a ranking performance measure definedat the level of the list is optimized directly.

When designing recommender systems and, especially, “learning-to-rank”strategies, the issue is that of choosing which one of the “pointwise,”“pairwise,” and “listwise” approaches should be adopted, depending onthe data distribution (e.g., sparsity level, label noise, input featuredistribution, . . . ).

Each approach has its own advantages and drawbacks, in terms of rankingperformance, robustness to noise, and computational complexity. Toalleviate the weaknesses, researchers in each approach have developed“ad-hoc” improvements, often based on heuristics.

For instance, pointwise approaches can be improved by specificinstance-weighting schemes and by using user-level feature normalization(e.g., removing the “user rating bias”). Pairwise approaches, which arevery sensitive to label noise and to the choice of the “negative”instances, have been enhanced by specific sampling strategies whenfeeding the “learning-to-rank” algorithm with training “triplets.”

Even if pointwise and pairwise approaches have their own complementaryadvantages and drawbacks, most of the conventional frameworks focus ononly one approach, trying to mitigate their drawbacks in some ad-hocway.

Wang, Y., Wang, S., Tang, J., Liu, H., Li, B.: “PPP: joint pointwise andpairwise image label prediction”, in: 2016 IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2016, Las Vegas, Nev., USA, Jun.27-30, 2016, pages 6005 to 6013, relates to image label prediction. Ajoint pointwise and pairwise approach is disclosed according to whichthe two loss functions corresponding to the pointwise and pairwiseapproaches are added, and the problem is solved jointly through agradient descent on the total loss at every iteration.

Lei, Y., Li, W., Lu, Z., Zhao, M.: “Alternating pointwise-pairwiselearning for personalized item ranking”, in: Proceedings of the 2017 ACMon Conference on Information and Knowledge Management, CIKM '17, ACM,New York, N.Y., USA, pages 2155 to 2158, relates to personalized itemranking. A joint pointwise and pairwise approach is disclosed accordingto which the two loss functions corresponding to the pointwise andpairwise approaches are also added, as in Wang et al., but the problemis solved alternately through a gradient descent on the pointwise lossfollowed by a gradient descent on the pairwise loss.

Thus, it is desirable to utilize the pointwise and pairwise approachesin an adaptive way.

Moreover, it is desirable to provide an adaptive pointwise-pairwiselearning-to-rank method that overcomes the above deficiencies.

Furthermore, it is desirable to provide an adaptive pointwise-pairwiselearning-to-rank method that introduces a (meta-)learnable adaptivecombination of the two approaches, so that the precise balance betweenpointwise and pairwise contributions depends on a particular tripletinstance <user u, item i, item j> taken as input.

Also, it is desirable to provide an adaptive pointwise-pairwiselearning-to-rank method that is not mutually exclusive, but combines thepointwise and pairwise approaches for the same task and dataset, anddescribes a way, learned from the data, to combine the pointwise andpairwise approaches optimally and adaptively.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings are only for purposes of illustrating various embodimentsand are not to be construed as limiting, wherein:

FIG. 1 is a functional block diagram illustrating a method of rankingitems for a given entity according to an embodiment;

FIG. 2 is a functional block diagram illustrating sub-steps of step 102of FIG. 1 for training a learnable scoring function on a set of tripletsaccording to an embodiment;

FIG. 3 illustrates an example of architecture in which the disclosedmethods may be performed;

FIG. 4 shows Table 1;

FIG. 5 shows Table 2; and

FIG. 6 shows Table 3.

DETAILED DESCRIPTION

Described herein are systems and methods for ranking items inrelationship with their relevance to a user. For purposes ofexplanation, numerous examples and specific details are set forth inorder to provide a thorough understanding of the described embodiments.

The illustrative embodiments will be described with reference to thedrawings wherein like elements and structures are indicated by likereference numbers.

In a “learning-to-rank” problem, the task is to obtain a scoringfunction ƒ that allows the ordering of items according to theirrelevance to a given user (most of the time in decreasing order). Inmore detail, this function, which must be learnable (i.e., it mustdepend on a set of learnable parameters, called θ), takes as input apair <user u, item i> and outputs a relevance score of the item i forthe user u.

In order to obtain a scoring function ƒ able to optimally score itemsaccording to their relevance for a given user, the scoring function mustfirst be “trained.” The “training phase” of the scoring function aims atoptimizing internal, learnable parameters of the scoring function bytraining it on a set of triplets <user u, item i, item j> for which theexpected result is known (i.e., in the present case, the relativerelevancies of items i and j for the user u, as explained in more detailbelow).

For this purpose, the scoring function is applied to the pair <u, i> andoutputs a relevance score of the item i for the user u. Similarly, thescoring function is applied to the pair <u, j> and outputs a relevancescore of the item j for the user u. The items i and j can then be ranked(for the user u) according to their respective relevance scores.

An adequately chosen loss function can then be applied to the rankingfor evaluating the efficiency (or degree of validity) of this ranking,based on the known respective relevancies of items i and j for the useru.

In machine learning, a loss function (also called “cost function” orsimply “loss”) is a function that measures the discrepancy between theoutput of a model and the desired target/reference. For a given user anda given set of items, once a ranking has been obtained by applying thelearnable scoring function to each of the items of the set and byranking them according to their respective relevance scores, the betterthe obtained ranking is, the smaller will be the result of the lossfunction.

The aim of the training phase is to modify the internal learnableparameters of the learnable scoring function so that the learnableparameters minimize the result of the loss function. This optimizationof the loss function (which must be convergent towards a minimum) isrendered possible by “gradient descent optimization”, where gradientsare obtained from the partial derivatives of the loss function withrespect to the learnable parameters of the learnable scoring function.

These loss gradients are “back-propagated” to the respective learnableparameters in that the loss gradients are used to modify (or adapt orupdate) the learnable parameters to perform better at the next iterationof the training phase. At the next iteration, the learnable scoringfunction will output a probability of having the item i preferred to theitem j by the user u computed with the help of the modified learnableparameters. This probability will better reflect the correct ranking ofthe items, leading to a lower value of the loss function and to asmaller adaptation of the learnable parameters at each iteration of thegradient descent optimization, until the loss function converges towardsa minimum and no adaptation of the learnable parameters is necessaryanymore. Once the loss function has converged, the learnable scoringfunction has been successfully trained, and the learnable scoringfunction can then be used to perform item ranking.

While the explanations given above have been given in the context wherethe loss function must be minimized, which is the standard case in thefield, it is to be understood that in an alternative embodiment, with aminor mathematical redefinition of the loss function, the training phaseof the model may aim at maximizing a corresponding function, which iscalled “objective function” (or “gain function” or simply “utility”).

Thus, in order to encompass both embodiments, it will be referred in thefollowing more generally to the “optimization” of the loss function,which may then be minimized or maximized, depending on the specificdefinition of the loss function when training a given ranking model.

Pointwise ranking directly estimates the relevance of an item (ordocument) i for a user u, considering the relevance estimation as aclassification problem or as an ordinal regression problem (for gradedrelevancies). Groundtruth relevance labels in the training set (i.e., alabel for each given item of the training set stating whether the givenitem is relevant or not) are given either explicitly by manual relevanceassessments, or by user clicks (implicit user feedback).

In the latter case, some propensity weighting strategy should be used tocompensate for the different biases (e.g. position or layout biases).Most of the recommendation models assume that the relevance probabilityof an item i for user u can be expressed in the following form:

P(i|u)=σ(ƒ(u,i;θ)

where σ is typically the sigmoid function (σ(x)=1/(1+e^(−x))), and ƒ(u,i; θ) is the learnable scoring function mentioned above, with a set oflearnable parameters θ (which may be, in alternative embodiments, avector or a matrix of parameters), that estimates the relevance score ofthe item i for the user u. In other words, it estimates the likelihoodof the item i to be clicked by the user u.

As mentioned before, the learnable scoring function ƒ and its associatedparameters θ can be learned by solving a classification task, whichclassifies the item i as positive class (y_(u,i)=1) if it is clicked bythe user u (or manually assessed as relevant) or as negative class(y_(u,i)=0) if it is not clicked by the user u (or manually assessed asirrelevant). The learnable parameters θ of the classification functioncan be learned by maximizing the likelihood of correct relevance of anitem i for a user u:

$\arg{\max\limits_{\theta}{\prod\limits_{{{({u,i})} \in D},y_{u,i^{= 1}}}{{p\left( i \middle| u \right)}{\prod\limits_{{{({u,i})} \in D},{y_{u,i} = 0}}\left( {1 - {p\left( i \middle| u \right)}} \right)}}}}$

This amounts to minimizing the binary cross entropy loss

_(pointwise) which is given by:

${\mathcal{L}_{pointwise}(\theta)} = {{- {\sum\limits_{{({u,i})} \in D}\left( {{y_{u,i}\log{\sigma\left( {f\left( {u,{i;\theta}} \right)} \right)}} + {\left( {1 - y_{u,i}} \right){\log\left( {1 - {\sigma\left( {f\left( {u,{i;\theta}} \right)} \right)}} \right)}}} \right)}} + {\lambda{\theta }}}$

where λ∥θ∥ is a regularization term on θ, λ is the regularizationconstant, which is an hyper-parameter, and ∥θ∥ denotes a norm of θ.

In one embodiment, it denotes the squared L2-Norm of θ, which has beenused in the experimental setup described below. In alternativeembodiments, a L1-Norm, a L2-Norm, or more generally any other Ln-Norm,or a squared version of these norms can also be used, where n is aninteger greater than 2.

Pairwise ranking focuses on the relative order between a pair of items(or documents) to achieve correct ranking of this set of items for agiven user. One specific version of pairwise ranking is BayesianPersonalized Ranking (BPR) which optimizes the model parameters duringtraining by maximizing the estimated probability of an item i to bepreferred to an item j for a user u. This estimated probability is givenby:

p(i>j|u)=p(j<i|u)=δ(ƒ(u,i;θ)−f(u,j;θ))

where ƒ(u, i; θ) is again the learnable scoring function that assigns arelevance score of the item i for the user u and σ is the sigmoidfunction introduced above. A further variable y_(u,i>j) is defined, withy_(u,i>j)=1 when user u prefers item i over j, and y_(u,i>j)=0otherwise. The estimated probability p(i>j|u) is expected to be highwhen y_(u,i>j)=1, and to be low in the opposite case (y_(u,j>i)=1).

Maximizing the user preference over different items corresponds tomaximizing the likelihood of correct ordering for any item pair (i, j)given a user u:

$\arg{\max\limits_{\theta}{\prod\limits_{{{({u,i,j})} \in D},{y_{u,{i > j}} = 1}}{{p\left( {i > j} \middle| u \right)}{\prod\limits_{{{({u,i,j})} \in D},{y_{u,{j > i}} = 1}}{p\left( {i < j} \middle| u \right)}}}}}$

Maximizing this likelihood amounts to minimizing the following binarycross entropy loss:

$\left. {{\mathcal{L}_{pairwise}(\theta)} = {{\sum\limits_{{({u,i,j})} \in D}\left( {{y_{u,{i > j}}\log\;{\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {f\left( {u,{j;\theta}} \right)}} \right)}} + {\left( {1 - y_{u,{i > j}}} \right){\log\left( {1 - {\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {f\left( {u,{j;\theta}} \right)}} \right)}} \right)}}} \right)} + {\lambda{\theta }}}} \right)$

The pairwise ranking approach better formulates the ranking problem byusing relative preference (ordering) between two items. In particular,it does not require any inter-user normalization, which might benecessary in pointwise approaches to mitigate the inter-user variance.Pairwise approaches are less sensitive to class imbalance than pointwiseapproaches. Despite this, the pairwise ranking approach is moresensitive to noisy labels than pointwise ranking approach. The implicitfeedback naturally brings some noise since some irrelevant items couldbe clicked by mistake or some relevant items might not be clicked by theuser.

Hence, a learning strategy is proposed so that the model can decideitself, for each triplet <u, i, j>, which one of the pointwise and thepairwise approaches should be adopted. The proposed “meta-learning”strategy considers that there is a continuum between the pointwise andthe pairwise approaches: how to position the cursor in this continuumshould be learned from the data and should be dependent on theconsidered triplet. This can be done by utilizing a weightingcoefficient γ which softly determines the compromise between pointwiseand pairwise ranking. The idea is to modify the estimated probabilityp(i>j|u) as follows:

p(i>j|u)=σ(ƒ(u,i;θ)−γf(u,j;θ))

where γ can take values between [0,1].

In one embodiment, γ is a constant hyperparameter to be tuned/chosen ona validation set (together with other hyperparameters).

In another embodiment, γ is computed as a function of user u, items iand j (possibly including the ranks of these items): γ=g(u, i, j;θ_(g)), where θ_(g) is another set of learning parameters distinct fromθ (θ_(g) may be, in alternative embodiments, a vector or a matrix ofparameters).

The learnable function g is also called the mixing function. Onepossible formulation of g would be a similarity function, which is areal-valued function that quantifies the similarity between two objects,in that it takes large values for similar objects and small values closeto zero for very dissimilar objects. Any similarity function fulfillingthese requirements may be used.

Consequently, the negative log-likelihood loss function can beformulated as:

${L_{adaptive}\left( {\theta,\theta_{g}} \right)} = {- {\sum\limits_{{({u,i,j})} \in D}\left( {{y_{u,{i > j}}\log{\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} + \left. \quad{\left( {1 - y_{u,{i > j}}} \right){\log\left( {1 - {\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} \right)}} \right) + {\lambda{\theta }} + {\lambda_{g}{\theta_{g}}}} \right.}}$

where λ∥θ∥ is a regularization term on θ and λ_(g)∥θ_(g)∥ is aregularization term on θ_(g), λ and λ_(g) are the regularizationconstants, which are hyper-parameters, and ∥θ∥ and ∥θ∥ denote a norm ofθ and θ_(g).

In one embodiment, it denotes the squared L2-Norm, which has been usedin the experimental setup described below. In alternative embodiments, aL1-Norm, a L2-Norm, or more generally any other Ln-Norm, or a squaredversion of these norms can also be used, where n is an integer greaterthan 2.

_(adaptive) (θ, θ_(g)) reduces to

_(pointwise) (θ) when γ, i.e., g(u, i, j; θ_(g)), equals to 0, and to

_(pairwise) (θ) when γ equals to 1.

In particular, depending on the particular triplet <u, i, j>, thecorresponding term in the loss function could correspond to (i) learningto classify a positive instance when g(u, i, j; θ_(g)) is close to 0 andy_(u,i>j)=1; (ii) learning to classify a negative instance when g(u, i,j; θ_(g)) is close to 0 and y_(u,j>i)=1; and (iii) learning to rank ihigher than j when g(u, i, j; θ_(g)) is close to 1 and y_(u,i>j)=1.

Note that the complete loss function in

_(adaptive) (θ, θ_(g)) considers all “positive” pairs (y_(u,i,j)=1) aswell as all “negative” pairs (y_(u,i,j)=0). While these two terms couldseem to be redundant at first glance, this is not the case: onlyconsidering the first term would lead to a trivial solution where g(u,i, j; θ_(g)) is 0 everywhere and the problem would reduce to train abinary classifier with only positive examples (any model ƒ(.; θ) thatoutputs a constant very high value would be perfect). The second termenforces the inclusion of “negative” examples as well, either inpointwise way, or in a pairwise way.

In a first embodiment, as mentioned above, the mixing function is aconstant hyperparameter. In alternative embodiments, however, the mixingfunction varies as a function of its parameters. In the following,concrete examples are given on how the mixing function g(u, i, j; θ_(g))can be implemented.

In a second embodiment, a purely Collaborative-Filtering setting isconsidered, in which only the interactions of the users with the itemsare taken into account, when the attributes and the content (e.g. thetitle and the body of the item) of the items and the users are notavailable. In this context, attributes of an item denote specificaspects contained in or related to the item (e.g. through metadata).

For example, for items being documents like the news articles consideredbelow in the experimental setup, the attributes of an item are, e.g.,words, features, topic, entities, or any other specific aspect of thedocument. The content of an item denotes broadly the whole datacontained in the item, e.g. the title and the body of the document, aswell as additional metadata (such as date of creation, author(s),popularity, . . . ) if the item is a document

Technically speaking, in the Collaborative-Filtering setting, the onlyavailable data consists of a sparse interaction matrix R, encoding therating or the click action of a user u for an item i. In this case, thefeatures associated to a user are the items she interacted with in thepast, as represented by the corresponding row in the R matrix.Reciprocally, the features associated to an item are the users whointeracted with it in the past, as represented by the correspondingcolumn in the R matrix.

Using standard matrix factorization techniques applied to theinteraction matrix R, the relevance score function ƒ (u, i; θ) takes theform of:

ƒ(u,i;θ)={right arrow over (u)} ^(T).{right arrow over (i)}

where {right arrow over (u)} and {right arrow over (i)} (E,

^(k)) are the embeddings associated to u and i respectively. By virtueof matrix factorization, an element R(u, i) of the interaction matrix Ris given by R(u, i)≅{right arrow over (u)}^(T).{right arrow over (i)}.

In an embodiment, the embeddings are derived from the decomposition ofthe matrix R into two low rank matrices. Alternatively, these embeddingsare obtained as output of two neural networks (e.g. multi-layerperceptrons), when using deep matrix factorization models.

In an embodiment, for this Collaborative-Filtering setting, the mixingfunction is defined by:

g(u,i,j;θ _(g))=σ({right arrow over (i)} ^(T) . W _(g) .{right arrowover (J)})

where {right arrow over (J)} is the embedding of item j.

In one embodiment, W_(g) is a k*k symmetric positive semi-definitematrix, typically of rang k′ much lower than the dimensionality k of{right arrow over (i)}. In other words, W_(g) is chosen as W_(g)=V_(g)^(T)V_(g), with V_(g) a k′*k matrix. In this disclosure, “much lower”means one or more orders of magnitude lower. Typically, k′ is chosen tobe of an order of magnitude of 10², e.g. k′=30, whereas k is chosen tobe of an order of magnitude of 10³, e.g. k=300.

In this embodiment, the mixing function g can be interpreted as ageneralized similarity measure between items i and j, where the metricsis learned from the data. In further embodiments, more complex modelsare considered, such as a model taking into account the ranks of i and jor a model using more than one (linear) layer, provided that theircomplexity is compatible with the size and the distribution of thetraining data.

In an embodiment, V_(g)=θ_(g) is the set of learnable parameters definedabove, which in that case is a matrix of dimension k′*k, which can beinitialized with small random numbers generated for example with aGaussian distribution around zero. In alternative embodiments, V_(g) canbe a more complex function of θ_(g).

In another embodiment, W_(g) is chosen as a diagonal matrix:W_(g)=diag(θ_(g)), where the set of learnable parameters θ_(g) is avector of dimension k and W_(g) is a diagonal matrix whose diagonal isgiven by θ_(g) and whose other values are zero. In both cases, the goalis to constraint the W_(g) matrix to have much less parameters to learn.

Intuitively, this kind of mixing function materializes the fact that“negative” items that are more similar to a “positive” item are morevaluable candidates than “far away” negative items to form trainingtriplets. But the precise metrics that defines this similarity islearned from the data.

In a third, more general embodiment, a purely Content-Based setting isconsidered, where the attributes and the content of the items and of theusers are taken into account. A learnable user embedding function e_(U)is used, that maps user features into a more powerful representation, aswell as a learnable item embedding function e_(I) that maps itemfeatures into the same space as the user embedding: it amounts totelling that the embeddings are defined by {right arrow over(u)}=e_(U)(x_(u);θ₁) and {right arrow over (i)}=e_(I)(x_(i); θ₂), whereθ₁ and θ₂ are two learnable parameters. These functions can take theform of simple linear transforms or deep neural networks.

Similar to the above mentioned Collaborative-Filtering setting, therelevance score function is given by ƒ(u, i; θ)={right arrow over(u)}^(T).{right arrow over (i)}^(T), and the following mixing functionis used:

g(u,i,j;θ _(g))=σ({right arrow over (i)} ^(T) .W _(g) .{right arrow over(J)})

with {right arrow over (J)}=e_(I)(x_(j); θ₂).

In one embodiment, W_(g) is a k*k symmetric positive semi-definitematrix, typically of rang k′ much lower than the dimensionality k of{right arrow over (i)}. In other words, W_(g) is chosen as W_(g)=V_(g)^(T)V_(g), with V_(g) a k′*k matrix. The same typical orders ofmagnitude or values as the ones discussed above for theCollaborative-Filtering setting for k and k′ can also be used in theContent-Based setting.

In an embodiment, V_(g)=θ_(g), where the set of learnable parametersθ_(g) is the matrix of dimension k′*k defined above, which again can beinitialized for example with small values generated by a Gaussiandistribution around zero. In alternative embodiments, V_(g) can be amore complex function of θ_(g).

In another embodiment, W_(g) is chosen as a diagonal matrix: W_(g)=diag(θ_(g)), where the set of learnable parameters θ_(g) is a vector ofdimension k and W_(g) is a diagonal matrix whose diagonal is given byθ_(g) and whose other values are zero. In both cases, the goal is toconstraint the W_(g) matrix to have much less parameters to learn.

Alternatively, in all cases mentioned above, the mixing function can beexpressed directly in terms of item features:

g(u,i,j;θ _(g))=σ(x _(i) ^(T) .W _(g) .x _(j))

with the same symmetric semi-definite positive and low rank constraintson W_(g). As in the Collaborative-Filtering setting, more complex modelscould be envisaged as well, such as the one including the ranks of i andj or the one using more than one (linear) layer, provided that theircomplexity is compatible with the size and the distribution of thetraining data

FIG. 1 is a functional block diagram illustrating a method of rankingitems for a given entity. At step 102, at which a learnable scoringfunction ƒ is trained on a set of triplets <u, i, j> each comprising anentity u (an entity is defined to be any user, context, or user query)from a set of entities (or set of any user, context, or user querycomprising any combination of users, contexts, and user queries) and apair of items i and j from a set of items, wherein a relative relevanceof items i and j for entity u is known for each triplet <u, i, j> of theset.

This training step aims at learning optimized values of a first set oflearnable parameters θ of the learnable scoring function. This isachieved by optimizing a loss function depending on the first set oflearnable parameters θ, a second set of learnable parameters θ_(g), andon a probability of having the item i preferred to the item j by theentity u, wherein the probability defines a continuum between pointwiseranking and pairwise ranking of items through a learnable mixingfunction depending on θ_(g). θ and θ_(g) may be vectors or matrixes ofparameters, depending on the embodiments. Further technical details onthis training step are illustrated in FIG. 2.

At step 104, the trained learnable scoring function ƒ (for which theoptimized values of the first set of learnable parameters θ, on which ƒdepends, have now been determined in the training phase) is then applied(using the optimized values of θ) to all pairs <u′, i′> formed by agiven entity u′ from the set of entities and by every item i′ from theset of items, to obtain a relevance score for each pair.

At step 106, all items i′ from the set of items are ranked for theentity u′ based on the respective relevance scores.

In an embodiment, the items are ranked by decreasing relevance score.

Alternatively, the items are ranked by increasing relevance score.Further, in an embodiment, the ranked items are provided (presented,sent, displayed, . . . ) to the entity u′. This achieves the goal of the“learning to rank” problem defined above.

In an embodiment, this ranking can be used to automatically controlclient devices, including e.g. autonomous vehicles, robots, computers,or cell phones, as illustrated in FIG. 3.

In an embodiment, a controller of a client device like the abovementioned ones takes the result of the ranking output at step 104 of themethod of FIG. 1 to determine a control command to be performed by theclient device.

FIG. 2 is a functional block diagram illustrating sub-steps of themethod step 102 of FIG. 1 for training a learnable scoring function on aplurality of triplets.

At step 202, a (next) triplet <u, i, j> from the set of triplets definedin step 102 of FIG. 1 is taken as input, and a loop of steps 204 to 214is performed so that steps 204 to 214 are applied to each input tripletuntil there are no remaining triplets in the set, as indicated by thecondition defined in step 214 (at which it is determined whether thereis still a remaining triplet in the set).

If it is determined at step 214 that there is no remaining triplet inthe set, the value of the loss function can be computed (as a sum overall the triplets of the set) and gradient descent optimization can beperformed at step 216 to optimize the values of the sets of learnableparameters θ and θ_(g).

In more detail, as the learnable parameters are gradually adapted untilconvergence of the loss function towards its optimum, a loop of steps202 to 216 (not shown) takes place until convergence of the lossfunction is achieved. When the loss function has converged to itsoptimum (minimum or maximum, as discussed above), the optimized valuesof θ and θ_(g) are obtained at (the last occurrence of) step 216 and thetraining phase of step 102 of FIG. 1 ends at step 218.

At step 204, a weighting coefficient g(u, i, j; θ_(g)) is computed byapplying the learnable mixing function g to the entity u, the two itemsi and j, and the second set of learnable parameters θ_(g). The weightingcoefficient is a value within the interval [0,1], which will be appliedas a weighting coefficient to the first relevance score ƒ(u, j; θ) atstep 210.

In an embodiment, as mentioned above, the mixing function is a constanthyperparameter. In another embodiment, the mixing function is alearnable similarity function, i.e. a similarity function in which theprecise metrics that defines this similarity is learned from the data.

In an embodiment, the mixing function is defined by g(u, i, j;θ_(g))=σ(i^(T).W_(g).{right arrow over (J)}) where σ is the sigmoidfunction

$\begin{matrix}{\left( {{\sigma(x)} = \frac{1}{1 + e^{- x}}} \right),} & \;\end{matrix}$

and {right arrow over (i)} and {right arrow over (J)} are tne embeddingsof items i and j, respectively.

In an embodiment, W_(g) is a k*k symmetric positive semi-definitematrix: W_(g)=V_(g) ^(T)V_(g) with V_(g) a k′*k matrix ofrank(V_(g))=k′>>k, as defined above. Typical orders of magnitude orvalues for k and k′ have already been given above. In that case, themixing function g can be interpreted as a generalized similarity measurebetween items i and j, where the metrics is learned from the data. Infurther embodiments, more complex models are considered, such as a modeltaking into account the ranks of i and j or a model using more than one(linear) layer, provided that their complexity is compatible with thesize and the distribution of the training data.

In an embodiment, V_(g)=θ_(g), that is W_(g)=θ_(g) ^(T)θ_(g), where thesecond set of learnable parameters θ_(g) is a matrix of dimension k′*k.In other embodiments, V_(g) can be defined as a more complex function ofθ_(g).

In an alternative embodiment, W_(g)=diag(θ_(g)), where the second set oflearnable parameters θ_(g) is a vector of dimension k and W_(g) is adiagonal matrix whose diagonal is given by θ_(g) and whose other valuesare zero.

In the general case, a purely Content-Based setting is considered (wherethe attributes and the content of the items and of the entities aretaken into account, as discussed above).

In this embodiment, a learnable entity embedding function e_(U) is used,that maps entity features into a more powerful representation: {rightarrow over (u)}=e_(U)(x_(u); θ₁), as well as a learnable item embeddingfunction e_(I) that maps item features into the same space as the entityfeatures: the item embeddings are thus defined by {right arrow over(i)}=e_(I)(x_(i); θ₂) and {right arrow over (j)}=e_(I)(x_(j); θ₂), whereθ₁ and θ₂ are learnable parameters. These functions can take the form ofsimple linear transforms or deep neural networks.

In a more restrictive embodiment, compatible with all the definitions ofg given above, a purely Collaborative-Filtering setting is considered(in which only the interactions of the entities with the items are takeninto account, while the attributes and the content of the items and theentities are ignored, as discussed above)

In this embodiment, the only available data consists of a sparseinteraction matrix R, encoding the rating or the click action of all theentities for all the items. In this embodiment, the embeddings of anentity u and an item i (or j) are derived from the decomposition of thematrix R into two low rank matrices. Alternatively, these embeddings areobtained as output of two neural networks (e.g. multi-layerperceptrons), when using deep matrix factorization models.

Unless explicitly defined as alternative embodiments, any combination ofthe embodiments described above for defining the mixing function, theembeddings, and the matrix W_(g) can be utilized.

At steps 206 and 208, a first relevance score ƒ(u, j; θ) and a secondrelevance score ƒ(u, i; θ) are computed by applying the learnablescoring function ƒ to the entity u, and to the items j and i,respectively. In an embodiment, the relevance score for an item i isgiven by ƒ(u, i; θ)={right arrow over (u)}^(T).{right arrow over (i)}.In another embodiment, the relevance score for an item i is also givenby a bilinear form (similar to the one used for the mixing function inthe embodiments described above in relation with step 204) ƒ(u, i;θ)={right arrow over (u)}^(T).W.{right arrow over (i)}, where W is a k*ksymmetric positive semi-definite matrix and k is the dimension of theembeddings {right arrow over (u)} and {right arrow over (i)}.

In an embodiment, W=θ^(T)θ, where the first set of learnable parametersθ is a matrix of dimension k′*k, where k′ is one or more orders ofmagnitude lower than k. In an alternative embodiment, W=diag(θ), where θis a vector of dimension k and W is a diagonal matrix whose diagonal isgiven by θ and whose other values are zero.

More generally, in a formulation that encompasses the above alternativeembodiments, the learnable scoring function is given by ƒ(u, i;θ)=Matching(e_(U)(x_(u); θ₁), e_(I)(x_(i); θ₂)), where e_(U) is alearnable entity embedding function that maps entity features x_(u) intoan entity embedding {right arrow over (u)}=e_(U)(x_(u); θ₁), e_(I) is alearnable item embedding function that maps item features x_(i) into anitem embedding {right arrow over (i)}=e_(I)(x_(i); θ₂), and θ₁ and θ₂are learnable parameters.

The Matching function acts as a similarity measure between the entityembedding and the item embedding that outputs a relevance score for theitem corresponding to the item embedding. In an embodiment, thissimilarity measure increases with the relevance of the item for theentity corresponding to the entity embedding.

In the general case, compatible with all the definitions of ƒ givenabove, where a purely Content-Based setting is considered, theembeddings are defined by {right arrow over (i)}=e_(I)(x_(i), θ₂) and{right arrow over (J)}=e_(I)(x_(j), θ₂), as described above in relationwith the computation of the mixing function in the Content-Based settingembodiment in step 204.

In the more restrictive embodiment, also compatible with all thedefinitions of ƒ given above, where a purely Collaborative-Filteringsetting is considered, the embeddings {right arrow over (u)} and {rightarrow over (i)} are derived from the decomposition of the sparseinteraction matrix R into two low rank matrices, as described above inrelation with the computation of the mixing function in theCollaborative-Filtering setting embodiment in step 204.

At step 210, the weighting coefficient g(u, i, j; θ_(g)) computed atstep 204 is applied to the first relevance score ƒ(u, j; θ) computed atstep 206. In an embodiment, the first relevance score ƒ(u, j; θ)computed at step 206 is multiplied (or “weighted”) by the weightingcoefficient g(u, i, j; θ) computed at step 204.

At step 212, the (estimated) probability of having the item i preferredto the item j by the entity u is computed as a function of a differencebetween the second relevance score and the weighted first relevancescore. In an embodiment, this probability is computed as:

p(i>j|u)=σ(ƒ(u,i;θ)−g(u,i,j;θ _(g)).ƒ(u,j;θ))

where σ is the sigmoid function.

At step 214, it is determined whether there is a remaining triplet <u,i, j> in the set. If it is the case, the previous steps 202 to 212 arerepeated for the next triplet, in order to compute the probability ofhaving the item i preferred to the item j by the entity u (as computedat step 212) for each triplet of the set.

Otherwise, the method continues at step 216, at which the knowntechnique of gradient descent optimization is used to gradually optimizethe values of the first and second sets of learnable parameters θ andθ_(g), as mentioned above. This gradual optimization takes place over aplurality of iterations on the previous steps 202 to 216 (not shown),until these parameters reach their optimized values. This happens whenthe loss function used for the gradient descent optimization convergestowards its optimum value.

In an embodiment, the loss function is a negative log-likelihood lossfunction and optimizing the loss function comprises minimizing the lossfunction over the training set of entities and items. An exemplary lossfunction corresponding to this embodiment has been given above. However,in alternative embodiments, the loss function can be redefined toconverge to a maximum value.

Finally, when the loss function has reached its optimum, the optimizedvalues of the first and second sets of parameters θ and θ_(g) areobtained, and the training phase of step 102 of FIG. 1 ends at step 218of FIG. 2.

It is noted that the order and the number of the steps of FIG. 2 aremerely illustrative. The order and the number of the steps of FIG. 2 arenot meant to be restrictive and all the alternative embodiments that askilled person would consider in order to implement the same generalidea are also contemplated without departing from the scope of thepresent disclosure.

For example, the loop performed on the input triplets, or thecomputation in steps 204 to 210 of the different terms necessary tocompute the probability in step 212 can be implemented according to anyalternative formulation or ordering that would fall within the usualskills of a person skilled in the art.

In the following discussion, a user may be any user, context, or userquery or “entity” as defined above. The above concepts are nowexemplarily applied to a news recommendation use case, where items arenews articles.

A purely Content-Based approach is adopted for solving thisrecommendation problem: item features consist of an unsupervisedembedding of the bag of words of the news' content, while user featuresare derived from previous interactions with the system.

A user's action history at time t with N interactions is denoted asS_(t) ^(u)=(I₁ ^(u), I₂ ^(u), . . . , I_(N) ^(u)). Each interaction ofuser u consists of user's (implicit) feedback on a list of K recommendeditems. This interaction is defined by the list I^(u)=((i₁, u_(u,i) ₁ ),. . . , (i_(k), y_(u,i) _(k) ), . . . , (i_(K), y_(u,i) _(K) ) whichcontains clicked (y_(u,i) _(k) =1) and non-clicked (y_(u,i) _(k) =0)items from a list of K recommended items.

A user is represented by an embedding derived from previous interactions(clicked and not clicked items) in the following way:

$\overset{\rightarrow}{u_{t}} = {\mu_{u_{t}}^{+} - {\beta \odot \mu_{{\overset{¯}{u}}_{t}}}}$$\mu_{u_{t}}^{+} = {\frac{1}{\left\{ {\left. i \middle| y_{u,i} \right. = 1} \right\} }{\sum\limits_{\{{{i|y_{u,i}} = 1}\}}\ {e_{I}\left( {x_{i};\theta_{2}} \right)}}}$$\mu_{{\overset{\_}{u}}_{t}} = {\frac{1}{\left\{ {\left. i \middle| y_{u,i} \right. = 0} \right\} }{\sum\limits_{\{{{i|y_{u,i}} = 0}\}}\ {e_{I}\left( {x_{i};\theta_{2}} \right)}}}$

where μ_(u) _(t) ⁺ is a vector of dimension k defined by the mean of theuser u's clicked items' embeddings at time t (user positive centroid)and μ_(u) _(t) ⁻, is a vector of dimension k defined by the mean of theuser u's non-clicked items' embeddings at time t (user negativecentroid). β is a vector of learnable parameters of dimension k thatquantifies to which extent the non-clicked items are useful tocharacterize user preferences (i.e., β scales the user negativecentroid), and ⊙ denotes the element-wise product of two vectors. Thenotation of time t is dropped when it is clear in the context. It isassumed, in these equations, that ∥e_(I)(x_(i); θ₂)∥=1 for all i.

The learnable scoring function of an item i for a user u, is defined bya simple bilinear form:

ƒ(u,i;θ)={right arrow over (i)} ^(T) W{right arrow over (u)}

by using a diagonal weight matrix W=diag(θ), where θ is a vector ofdimension k. This amounts to consider a generalized dot product betweenthe user's and the item's representation, where the scaling factor (orweight) for each dimension is learned from the data. At the beginning ofthe training phase, θ is initialized to 1 (i.e., more precisely, to avector of dimension k with all its values being initialized to 1).

In this exemplary implementation, the mixing function g(u, i, j; θ_(g))is defined as being the sigmoid of a bilinear form:

g(u,i,j;θ _(g))=σ({right arrow over (i)} ^(T) W _(g) {right arrow over(J)})

where W_(g)=diag(θ_(g)) is a diagonal weight matrix, where θ_(g) is avector of dimension k. This amounts to consider a generalized dotproduct between the representations of the two items, but where thescaling factors (or weights) are also learned from the data and could bedifferent from the relevance score function parameters defined by thevector θ. At the beginning of the training phase, θ_(g) is initializedto 1 (i.e., more precisely, to a vector of dimension k with all itsvalues being initialized to 1).

AIRS news dataset consists of different genres of Korean news:entertainment, sports, and general. For each news category, users' newsreading activity logs for 7 days are available. User news readingactivities are composed of news article impressions and clicks torecommended news. Recommended news article list is composed oftwenty-eight news for entertainment and sports news, and eight forgeneral news. For the experimental setup, the users' activities takeninto account have been restricted to the impressions with clickinformation.

News article texts were lemmatized by using the KoNLPy toolkit, andlemmas appearing less than three times were discarded. In computationallinguistics, lemmatization is the algorithmic process of determining thelemma of a word based on its intended meaning, the lemma being thecanonical form common to a set of words having the same meaning.Segmentation of words was not needed, since the news articles arecomposed of words separated by blank spaces.

After preprocessing, the resulting vocabulary size is 40,000 forentertainment news, 30,000 for sports news, and 61,000 for general news.

The data is separated to train, validation and test set by temporalorder: The first three days form the training set, the fourth day formsthe validation set, and the last three days form the test set. Theexperiments are repeated for each model five times, each run is randomlyinitialized with a predefined seed specific to this run. For each run1000 users are selected at random (with the predefined seedinitialization of this specific run). It is ensured that each user hasat least one click on the first four days (train and validation sets),and at least one click on the last three days (test set).

The ranking approach described above in this disclosure is compared withpointwise, pairwise, and listwise ranking approaches. The modelparameters are optimized through stochastic back-propagation with anadaptive learning rate for each parameter with the binary cross entropyloss on the output. The mini-batch size (i.e. the number of instancesused for calculating weight update at each iteration) is chosen to be128.

For regularization, L₂ norm normalization (weight decay) and batchnormalization are applied (where the L2 norm is the square of a weightparameter, and batch normalization normalizes the input layer byadjusting and scaling the activations).

For hyperparameter tuning (i.e. the process of finding the values of thehyperparameters λ, plus the constant γ in the embodiment mentioned abovewhere the mixing function is chosen to be a constant hyper-parameter), agrid search over the learning rate values of {10⁻², 10⁻³, 10⁻⁴} has beenused, together with the regularization coefficient values of {10⁻³,10⁻⁴, 10⁻⁵, 10⁻⁶}.

The hyperparameter values are automatically chosen, the chosenhyperparameter setting being the one which leads to the best averagenormalized Discounted Cumulative Gain (NDCG) score of five runs on thevalidation set. The NDCG is a ranking measure that penalizes highlyrelevant documents appearing lower in a search result list, the penaltyapplied to these documents being logarithmically proportional to theposition of the document in the search result list.

The respective ranking performances of the proposed ranking approach(according to the experimental setting described above) and baselineranking approaches are now assessed in terms of hit ratio (HR), beingthe ratio of the number of clicks on an item over the total number ofclicks in the test data, NDCG (defined above), and mean Reciprocal Rank(MRR), being the mean of the multiplicative inverse of the rank of thefirst clicked item.

The average score of five runs and the standard deviation on the testset are reported in terms of these ranking measures to obtain a bettergeneralization estimate of the ranking performances. Furthermore, thestatistical significance of the results is assessed by using Wilcoxonsigned-rank test (a non-parametric paired difference test) withBonferroni correction (which is used to adjust confidence intervals tocounteract the problem of multiple comparisons). The recommendation listlength (i.e., how many documents are recommended to the user in the listof recommendations obtained as output of the models) for each model isgiven after a “@” in Tables 1, 2 and 3, as illustrated in FIG. 4-6

The proposed ranking model is then compared with pointwise, pairwise,and listwise ranking models on the AIRS news dataset. The results forpersonalized news recommendation on AIRS entertainment news are shown inTable 1 (FIG. 4), and the results for personalized news recommendationresults on AIRS sports and general news are shown in Table 2 (FIG. 5).

Table 3 (FIG. 6) shows general results for all types of news on the AIRSnews dataset. The best scores are shown in bold, and the baseline modelscore is shown with “*”, when the best model (in bold) is significantlybetter than the baseline model with 5%, and “**” with 1%. 5% and 1%thresholds are corrected by using Bonferroni correction.

As illustrated in FIG. 4, Table 1 shows the results on AIRSentertainment news. The best results (maximum) are in bold. “*”indicates that a model is significantly worse than the best methodaccording to a Wilcoxon signed-rank test using Bonferroni correctionwith 5% and “**” with 1%.

More specifically, Table 1 shows the personalized news recommendationresults on AIRS entertainment news dataset in terms of HR@1, NDCG@8,NDCG@28, and MRR@28 scores. Adaptive pointwise-pairwise g ranking onentertainment dataset significantly improves over pointwise, pairwise,and listwise ranking in terms of HR@1, and adaptive pointwise-pairwise gsignificantly improves over all other ranking approaches in terms ofNDCG@8, NDCG@28, and MRR@28. The pairwise ranking leads to the lowestHR, NDCG, and MRR scores on AIRS entertainment news.

As illustrated in FIG. 5, Table 2 shows the results on AIRS sports news.The best results (maximum) are in bold. “*” indicates that a model issignificantly worse than the best method according to a Wilcoxonsigned-rank test using Bonferroni correction with 5% and “**” with 1%.

More specifically, Table 2 illustrates personalized news recommendationperformances of different ranking approaches on AIRS sports newsdataset. Similarly to what was observed in Table 1, on sports newsrecommendation dataset adaptive pointwise-pairwise g achieves thehighest HR@1, NDCG@8, NDCG@28, and MRR@28 scores. Adaptivepointwise-pairwise g achieves significantly better results thanpointwise, pairwise, and listwise ranking in terms of NDCG@8, NDCG@28,and MRR@28, and it also achieves significantly better results thanpairwise and listwise ranking in terms of HR@1. The pointwise rankingmodel is the leading model among other baseline ranking approaches, andit is followed by pairwise and listwise ranking, respectively.

As illustrated in FIG. 6, Table 3 shows the results on AIRS generalnews. The best results (maximum) are in bold. “*” indicates that a modelis significantly worse than the best method according to a Wilcoxonsigned-rank test using Bonferroni correction with 5% and “**” with 1%.

More specifically, Table 3 gives the personalized news recommendationresults of adaptive pointwise-pairwise and baseline ranking approacheson AIRS general news. Adaptive pointwise-pairwise g ranking leads to thebest scores in terms of HR@1, NDCG@8, and MRR@8 (the recommendation listlength is 8 for general news), that are significantly better than therest of the ranking approaches.

Although the above embodiments have been described in the context ofmethod steps, they also represent a description of a correspondingcomponent, module or feature of a corresponding apparatus or system.

Some or all of the method steps may be implemented by a computer in thatthey are executed by (or using) a processor, a microprocessor, anelectronic circuit or processing circuitry.

The embodiments described above may be implemented in hardware or insoftware. The implementation can be performed using a non-transitorystorage medium such as a computer-readable storage medium, for example afloppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROMor a FLASH memory. Such computer-readable media can be any availablemedia that can be accessed by a general-purpose or special-purposecomputer system.

Generally, embodiments can be implemented as a computer program productwith a program code or computer-executable instructions, the programcode or computer-executable instructions being operative for performingone of the methods when the computer program product runs on a computer.The program code or the computer-executable instructions may, forexample, be stored on a computer-readable storage medium.

In an embodiment, a storage medium (or a data carrier, or acomputer-readable medium) comprises, stored thereon, the computerprogram or the computer-executable instructions for performing one ofthe methods described herein when it is performed by a processor. In afurther embodiment, an apparatus comprises one or more processors andthe storage medium mentioned above.

In a further embodiment, an apparatus comprises means, for exampleprocessing circuitry like e.g. a processor communicating with a memory,the means being configured to, or adapted to, perform one of the methodsdescribed herein.

A further embodiment comprises a computer having installed thereon thecomputer program or instructions for performing one of the methodsdescribed herein.

The above-mentioned methods and embodiments may be implemented within anarchitecture such as illustrated in FIG. 3, which comprises server 300and one or more client devices 302 that communicate over a network 304(which may be wireless and/or wired) such as the Internet for dataexchange. Server 300 and the client devices 302 include a data processor312 and memory 313 such as a hard disk. The client devices 302 may beany device that communicates with server 300, including autonomousvehicle 302 b, robot 302 c, computer 302 d, or cell phone 302 e.

More specifically, in one embodiment, a user of the client device 302requests (or is automatically provided with) a list of recommendationsas output of the ranking method described above in relation with FIG. 1.The training phase (step 102 of FIG. 1) as well as the appliance of thetrained learnable scoring function to answer the request (step 104 ofFIG. 1) may be processed at server 300 (or at a different server oralternatively directly at the client device 302), and a resulting listof recommended items is sent back to the client device 302, where it canbe displayed or provided to the user on any other way.

It is finally noted that in an embodiment compatible with all theembodiments described above, the methods described in relation withFIGS. 1 and 2 can be used to automatically control any client devicelike the client devices 302, including e.g. autonomous vehicle 302 b,robot 302 c, computer 302 d, or cell phone 302 e, among other types ofclient devices known in the field.

As described above, a computer-implemented method of constructing aranking model for an information retrieval system by ranking items for agiven entity with a learnable scoring function, wherein an entity is oneof a user, a query, or a context, the method comprises (a) training alearnable scoring function ƒ, which depends on a first set of learnableparameters θ and on a set of triplets <u, i, j>, the set of tripletscomprising an entity u from a set of entities and a pair of items i andj from a set of items, wherein a relative relevance of each pair ofitems i and j from the set of items for each entity u from the set ofentities is known, to learn optimized values of the first set oflearnable parameters θ; the training the learnable scoring function ƒincluding (a1) optimizing a loss function depending on the first set oflearnable parameters θ and on a second set of learnable parametersθ_(g), wherein the loss function defines a continuum between pointwiseranking and pairwise ranking of items through its dependency on alearnable mixing function g depending on θ_(g); (b) applying, using thelearned optimized values of the first set of learnable parameters θ, thetrained learnable scoring function ƒ to all input pairs <u′, i′> formedby a given entity u′ from the set of entities and by every item i′ fromthe set of items to obtain a relevance score for each pair; (c) rankingthe items i′ for the entity u′ based on the respective relevance scores;and (d) constructing, using the ranked the items i′ for the entity u′, aranking model for an information retrieval system.

The training the learnable scoring function ƒ, may further comprise:(a2) computing, for each triplet <u, i, j> of the set of triplets, aweighting coefficient g(u, i, j; θ_(g))∈[0,1] by applying the learnablemixing function g to the entity u, the two items i and j, and the secondset of learnable parameters θ_(g); (a3) computing, for each triplet <u,i, j> of the set of triplets, a first relevance score ƒ(u, j; θ)defining a relevance of the item j for the entity u, by applying thelearnable scoring function ƒ to the item j and the entity u; (a4)computing, for each triplet <u, i, j> of the set of triplets, a secondrelevance score ƒ(u, i; θ) defining a relevance of the item i for theentity u, by applying the learnable scoring function ƒ to the item i andthe entity u; (a5) applying, for each triplet <u, i, j> of the set oftriplets, the weighting coefficient g(u, i, j; θ_(g)) to the firstrelevance score ƒ(u, j; θ); (a6) computing, for each triplet <u, i, j>of the set of triplets, a probability of having the item i preferred tothe item j by the entity u as a function of a difference between thesecond relevance score ƒ(u, i; θ) and the weighted first relevancescore, wherein the computed probability corresponds to a pairwiserelevance probability of having the item i preferred to the item j bythe entity u if g(u, i, j; θ_(g))=1, to a pointwise relevanceprobability defining how relevant the item i is to the entity u if g(u,i, j; θ_(g))=0, and defines the continuum between pointwise ranking andpairwise ranking of items if 0<g(u, i, j; θ_(g))<1; and (a7) learningoptimized values of the first and second sets of learnable parameters θand θ_(g) by optimizing the loss function, depending on θ and θ_(g),through gradient descent optimization, the loss function being definedas a sum over all triplets <u, i, j> of a function derived from theprobability of having the item i preferred to the item j by the entityu.

The function of the difference between the second relevance score andthe weighted first relevance score may be a sigmoid function.

The loss function may be a negative log-likelihood loss function andoptimizing the loss function may minimize the loss function over the setof entities and the set of items.

The negative log-likelihood loss function may be defined by:

${{L_{adaptive}\left( {\theta,\theta_{g}} \right)} = {- {\sum\limits_{{({u,i,j})} \in D}\left( {{y_{u,{i > j}}\log{\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} + {\left( {1 - y_{u,{i > j}}} \right){\log\left( {1 - {\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} \right)}}} \right)}}},$

wherein D is the set of triplets <u, i, j>, σ is the sigmoid function,ƒ(u, i; θ) is the learnable scoring function that assigns a relevancescore to the item i for the entity u, and y_(u,i>j) is a variabledefined by y_(u,i>j)=1 when entity u prefers item i over j, andy_(u,i>j)=0 otherwise.

The learnable scoring function may be given by ƒ(u, i;θ)=Matching(e_(U)(x_(u); θ₁), e_(I)(x_(i); θ₂)), where e_(U) is alearnable entity embedding function that maps entity features x_(u) intoan entity embedding {right arrow over (u)}=e_(U)(x_(u); θ₁) and e_(I) isa learnable item embedding function that maps item features x_(i) intoan item embedding {right arrow over (i)}=e_(I)(x_(i); θ₂), where θ₁ andθ₂ are learnable parameters, and where the Matching function acts as asimilarity measure between the entity embedding and the item embeddingthat outputs a relevance score for the item corresponding to the itemembedding, wherein the relevance score increases with the relevance ofthe item for the entity corresponding to the entity embedding.

The learnable scoring may be given by ƒ(u, i; θ)={right arrow over(u)}^(T).{right arrow over (i)}.

The learnable scoring function may be given by ƒ(u, i; θ)={right arrowover (u)}^(T) W.{right arrow over (i)}; wherein W=θ^(T)θ is a k*ksymmetric positive semi-definite matrix; k is the dimension of theembeddings {right arrow over (u)} and {right arrow over (i)}; the firstset of learnable parameters θ is a matrix of dimension k′*k; and k′ isan integer one or more orders of magnitude lower than k.

The learnable scoring function may be given by ƒ(u, i; θ)={right arrowover (u)}^(T) W.{right arrow over (i)}; wherein W=diag (θ) is a k*kdiagonal matrix whose diagonal is given by θ and whose other values arezero; k is the dimension of the embeddings {right arrow over (u)} and{right arrow over (i)}; and the first set of learnable parameters θ is avector of dimension k.

Under a Collaborative-Filtering setting making use only of entity-iteminteractions, the only available data for computing {right arrow over(u)} and {right arrow over (i)} may consist of a sparse interactionmatrix R encoding a rating or a click action of an entity u for an itemi.

The learnable mixing function g may be a learnable similarity function.

The learnable mixing function g may be a constant hyper-parameter.

The learnable mixing function may be defined by g (u, i, j;θ_(g))=σ({right arrow over (i)}^(T).W_(g).{right arrow over (J)}), where{right arrow over (i)} and {right arrow over (J)} are the embeddings ofitems i and j, respectively, σ is a sigmoid function, W_(g)=θ_(g)^(T)θ_(g) is a k*k symmetric positive semi-definite matrix, k is thedimension of the embeddings {right arrow over (i)} and {right arrow over(J)}, the second set of learnable parameters θ_(g) is a matrix ofdimension k′*k, and k′ is an integer one or more orders of magnitudelower than k.

The learnable mixing function may be defined by g (u, i, j;θ_(g))=σ({right arrow over (i)}^(T).W_(g).{right arrow over (J)}), where{right arrow over (i)} and {right arrow over (J)} are the embeddings ofitems i and j, respectively, a is a sigmoid function, and W_(g)=diag(θ_(g)) is a k*k diagonal matrix whose diagonal is given by θ_(g) andwhose other values are zero, k is the dimension of the embeddings {rightarrow over (u)} and {right arrow over (i)}, and the second set oflearnable parameters θ_(y) is a vector of dimension k.

The entity embedding at time t may be given by {right arrow over(u)}_(t)=μ_(u) ₁ ⁺−β⊙μ_(u) _(t) ⁻, wherein μ_(u) _(t) ⁺ is the mean ofthe embeddings of the items clicked by the entity u up to time t andμ_(u) _(t) ⁻ is the mean of the embeddings of the items non-clicked bythe entity u up to time t over a list of K items recommended to theentity, β is a vector of learnable parameters of dimension k thatquantifies to which extent the non-clicked items are useful tocharacterize entity preferences, and ⊙ is the element-wise product oftwo vectors.

It will be appreciated that variations of the above-disclosedembodiments and other features and functions, or alternatives thereof,may be desirably combined into many other different systems orapplications. Also, various presently unforeseen or unanticipatedalternatives, modifications, variations, or improvements therein may besubsequently made by those skilled in the art which are also intended tobe encompassed by the description above and the following claims.

1. A computer-implemented method of constructing a ranking model for aninformation retrieval system by ranking items for a given entity with alearnable scoring function, wherein an entity is one of a user, a query,or a context, the method comprising: (a) training a learnable scoringfunction ƒ, which depends on a first set of learnable parameters θ andon a set of triplets <u i, j>, the set of triplets comprising an entityu from a set of entities and a pair of items i and j from a set ofitems, wherein a relative relevance of each pair of items i and j fromthe set of items for each entity u from the set of entities is known, tolearn optimized values of the first set of learnable parameters θ; saidtraining the learnable scoring function ƒ including (a1) optimizing aloss function depending on the first set of learnable parameters θ andon a second set of learnable parameters θ_(g), wherein the loss functiondefines a continuum between pointwise ranking and pairwise ranking ofitems through its dependency on a learnable mixing function g dependingon θ_(g); (b) applying, using the learned optimized values of the firstset of learnable parameters θ, the trained learnable scoring function ƒto all input pairs <u′, i′> formed by a given entity u′ from the set ofentities and by every item i′ from the set of items to obtain arelevance score for each pair; (c) ranking the items i′ for the entityu′ based on the respective relevance scores; and (d) constructing, usingthe ranked the items i′ for the entity u′, a ranking model for aninformation retrieval system.
 2. The method as claimed in claim 1,wherein said training the learnable scoring function ƒ, furthercomprises: (a2) computing, for each triplet <u, i, j> of the set oftriplets, a weighting coefficient g(u, i, j; θ_(g))∈[0,1] by applyingthe learnable mixing function g to the entity u, the two items i and j,and the second set of learnable parameters θ_(g); (a3) computing, foreach triplet <u, i, j> of the set of triplets, a first relevance scoreƒ(u, j, θ) defining a relevance of the item j for the entity u, byapplying the learnable scoring function ƒ to the item j and the entityu; (a4) computing, for each triplet, <u, i, j> of the set of triplets, asecond relevance score ƒ(u, i; θ) defining a relevance of the item i forthe entity u, by applying the learnable scoring function ƒ to the item iand the entity u; (a5) applying, for each triplet <u, i, j> of the setof triplets, the weighting coefficient g(u, i, j; θ_(g)) to the firstrelevance score ƒ(u, j; θ); (a6) computing, for each triplet <u, i, j>of the set of triplets, a probability of having the item i preferred tothe item j by the entity u as a function of a difference between thesecond relevance score ƒ(u, i; θ) and the weighted first relevancescore, wherein the computed probability corresponds to a pairwiserelevance probability of having the item i preferred to the item j bythe entity u if g(u, i, j, θg)=1, to a pointwise relevance probabilitydefining how relevant the item i is to the entity u if g(u, i, j;θ_(g))=0, and defines said continuum between pointwise ranking andpairwise ranking of items if 0<q(u, i, j; θg)<1; and (a7) learningoptimized values of the first and second sets of learnable parameters θand θ_(g) by optimizing the loss function, depending on θ and θg,through gradient descent optimization, the loss function being definedas a sum over all triplets <u, i, j> of a function derived from theprobability of having the item i preferred to the item j by the entityu.
 3. The method as claimed in claim 2, wherein the function of thedifference between the second relevance score and the weighted firstrelevance score is a sigmoid function.
 4. The method as claimed in claim1, wherein the loss function is a negative log-likelihood loss functionand optimizing the loss function minimizes the loss function over theset of entities and the set of items.
 5. The method as claimed in claim2, wherein the loss function is a negative log-likelihood loss functionand optimizing the loss function minimizes the loss function over theset of entities and the set of items.
 6. The method as claimed in claim4, wherein the negative log-likelihood loss function is defined by:${{L_{adaptive}\left( {\theta,\theta_{g}} \right)} = {- {\sum\limits_{{({u,i,j})} \in D}\left( {{y_{u,{i > j}}\log{\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} + {\left( {1 - y_{u,{i > j}}} \right){\log\left( {1 - {\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} \right)}}} \right)}}},$wherein D is the set of triplets <u, i, j>, σ is the sigmoid function,ƒ(u, i; θ) is the learnable scoring function that assigns a relevancescore to the item i for the entity u, and y_(u,i>j) is a variabledefined by y_(u,i>j)=1 when entity u prefers item i over j, andy_(u,i>j)=0, otherwise.
 7. The method as claimed in claim 5, wherein thenegative log-likelihood loss function is defined by:${{L_{adaptive}\left( {\theta,\theta_{g}} \right)} = {- {\sum\limits_{{({u,i,j})} \in D}\left( {{y_{u,{i > j}}\log{\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} + {\left( {1 - y_{u,{i > j}}} \right){\log\left( {1 - {\sigma\left( {{f\left( {u,{i;\theta}} \right)} - {{g\left( {u,i,{j;\theta_{g}}} \right)}{f\left( {u,{j;\theta}} \right)}}} \right)}} \right)}}} \right)}}},$wherein D is the set of triplets <u, i, j>, σ is the sigmoid function,ƒ(u, i; θ) is the learnable scoring function that assigns a relevancescore to the item for the entity u, and y_(u,i>j) is a variable definedby y_(u,i>j)=1 when entity prefers item i over j, and y_(u,i>j)=0otherwise.
 8. The method as claimed in claim 1, wherein the learnablescoring function is given by θ(u, i; θ)=Matching(e_(U)(x_(u); θ₁),e_(I)(x_(i);θ₂)), where e_(U) is a learnable entity embedding functionthat maps entity features x_(u) into an entity embedding {right arrowover (u)}=e_(I)(x_(u);θ₁) and e_(j) is a learnable item embeddingfunction that maps item features x_(i) into an item embedding {rightarrow over (i)}=e_(I)(x_(i);θ₂) where θ₁ and θ₂ are learnableparameters, and where the Matching function acts as a similarity measurebetween the entity embedding and the item embedding that outputs arelevance score for the item corresponding to the item embedding,wherein the relevance score increases with the relevance of the item forthe entity corresponding to the entity embedding.
 9. The method asclaimed in claim 2, wherein the learnable scoring function is given byƒ(u, i; θ)=Matching(e_(U)(x_(u); θ₁), e_(I)(x_(I); θ₂), where e_(U) is alearnable entity embedding function that maps entity features x_(u) intoan entity embedding {right arrow over (u)}=e_(U)(x_(u); θ₁) and e_(I) isa learnable item embedding function that maps item features x_(i) intoan item embedding {right arrow over (i)}=e_(I)(x_(U); θ₂), where θ₁ andθ₂ are learnable parameters, and where the Matching function acts as asimilarity measure between the entity embedding and the item embeddingthat outputs a relevance score for the item corresponding to the itemembedding, wherein the relevance score increases with the relevance ofthe item for the entity corresponding to the entity embedding.
 10. Themethod as claimed in claim 4, wherein the learnable scoring function isgiven by ƒ(u, i; θ)=Matching(e_(U)(x_(u);θ₁), e_(I)(x_(i); θ₂)), wheree_(U) is a learnable entity embedding function that maps entity featuresx_(U) into an entity embedding {right arrow over (u)}=e_(U)(e_(u); θ₁)and e_(i) is a learnable item embedding function that maps item featuresx_(i) into an item embedding {right arrow over (i)}=e_(I)(x_(i); θ₂),where θ₁ and θ₂ are learnable parameters, and where the Matchingfunction acts as a similarity measure between the entity embedding andthe item embedding that outputs a relevance score for the itemcorresponding to the item embedding, wherein the relevance scoreincreases with the relevance of the item for the entity corresponding tothe entity embedding.
 11. The method as claimed in claim 5, wherein thelearnable scoring function is given by ƒ(u, i; θ)=Matching(e_(U)(x_(u);θ₁), e_(I)(x₁; θ₂)), where e_(U) is a learnable entity embeddingfunction that maps entity features x: into an entity embedding {rightarrow over (u)}=e_(U)(x_(U); θ₁) and e_(I) is a learnable item embeddingfunction that maps item features x_(j) into an item embedding {rightarrow over (i)}=e_(I)(x₁; θ₂) where θ₁ and θ₂ are learnable parameters,and where the Matching function acts as a similarity measure between theentity embedding and the item embedding that outputs a relevance scorefor the item corresponding to the item embedding, wherein the relevancescore increases with the relevance of the item for the entitycorresponding to the entity embedding.
 12. The method as claimed inclaim 8, wherein the learnable scoring function is given by ƒ(u, i;θ)={right arrow over (u)}^(T).{right arrow over (i)}.
 13. The method asclaimed in claim 8, wherein the learnable scoring function is given byƒ(u, i; θ)={right arrow over (u)}^(T).W.ī; W=θ^(T)θ being a k*ksymmetric positive semi-definite matrix; k being the dimension of theembeddings {right arrow over (u)} and i; the first set of learnableparameters θ being a matrix of dimension k′*k; k′ being an integer oneor more orders of magnitude lower than k.
 14. The method as claimed inclaim 8, wherein the learnable scoring function is given by ƒ(u, i;θ)={right arrow over (u)}^(T).W.{right arrow over (i)}; W=diag(θ) beinga k*k diagonal matrix whose diagonal is given by θ and whose othervalues are zero; k being the dimension of the embeddings {right arrowover (u)} and {right arrow over (i)}; the first set of learnableparameters θ being a vector of dimension k.
 15. The method as claimed inclaim 8, wherein under a Collaborative-Filtering setting making use onlyof entity-item interactions, the only available data for computing{right arrow over (u)} and {right arrow over (i)} consists of a sparseinteraction matrix R encoding a rating or a click action of an entity ufor an item i. 16-18. (canceled)
 19. The method as claimed in claim 1,wherein the learnable mixing function g is a learnable similarityfunction.
 20. The method as claimed in claim 2, wherein the learnablemixing function g is a learnable similarity function.
 21. The method asclaimed in claim 4, wherein the learnable mixing function g is alearnable similarity function.
 22. The method as claimed in claim 6,wherein the learnable mixing function g is a learnable similarityfunction.
 23. The method as claimed in claim 8, wherein the learnablemixing function g is a learnable similarity function. 24-27. (canceled)28. The method as claimed in claim 1, wherein the learnable mixingfunction g is a constant hyper-parameter.
 29. The method as claimed inclaim 2, wherein the learnable mixing function g is a constanthyper-parameter.
 30. The method as claimed in claim 4, wherein thelearnable mixing function g is a constant hyper-parameter.
 31. Themethod as claimed in claim 6, wherein the learnable mixing function g isa constant hyper-parameter.
 32. The method as claimed in claim 8,wherein the learnable mixing function g is a constant hyper-parameter.33-36. (canceled)
 37. The method as claimed in claim 1, wherein thelearnable mixing function is defined by g(u, i, j; θ_(g))=σ({right arrowover (i)}.W_(g).{right arrow over (J)}), where {right arrow over (i)}and {right arrow over (J)} are the embeddings of items i and j,respectively, σ is a sigmoid function, W_(g)=θ_(g) ^(T)θ_(g) is a k*ksymmetric positive semi-definite matrix, k is the dimension of theembeddings {right arrow over (i)} and {right arrow over (J)}, the secondset of learnable parameters θ_(g) is a matrix of dimension k′*k, and k′is an integer one or more orders of magnitude lower than k.
 38. Themethod as claimed in claim 2, wherein the learnable mixing function isdefined by g(u, i, j; θ_(g))=σ({right arrow over (i)}.W_(g).{right arrowover (J)}), where {right arrow over (i)} and {right arrow over (J)} arethe embeddings of items i and j respectively, σ is a sigmoid function,W_(g)=θ_(g) ^(T)θ_(g) is a k*k symmetric positive semi-definite matrix,k is the dimension of the embeddings {right arrow over (i)} and {rightarrow over (J)}, the second set of learnable parameters θ_(g) a matrixof dimension k′*k, and k′ is an integer one or more orders of magnitudelower than k.
 39. The method as claimed in claim 4, wherein thelearnable mixing function is defined by g(u, i, j; θ_(g))=σ({right arrowover (i)}^(T).W_(g).{right arrow over (J)}), where {right arrow over(i)} and {right arrow over (J)} are the embeddings of items i and j,respectively, σ is a sigmoid function, W_(g)=θ_(g) ^(T)θ_(g) is a k*ksymmetric positive semi-definite matrix, k is the dimension of theembeddings {right arrow over (i)} and {right arrow over (J)}, the secondset of learnable parameters θ_(g) is a matrix of dimension k′*k and k′is an integer one or more orders of magnitude lower than k.
 40. Themethod as claimed in claim 6, wherein the learnable mixing function isdefined by g(u, i, j; θ_(g))=σ({right arrow over (i)}^(T).W_(g).{rightarrow over (J)}), where {right arrow over (i)} and {right arrow over(J)} are the embeddings of items i and j, respectively, σ is a sigmoidfunction, W_(g)=θ_(g) ^(T)θ_(g) is a k*k symmetric positivesemi-definite matrix, k is the dimension of the embeddings {right arrowover (i)} and {right arrow over (J)}, the second set of learnableparameters θ_(g) is a matrix of dimension k′*k, and k′ is an integer oneor more orders of magnitude lower than k.
 41. The method as claimed inclaim 8, wherein the learnable mixing function is defined by g(u, i, j;θ_(g))=σ({right arrow over (i)}^(T).W_(g).{right arrow over (J)}), where{right arrow over (i)} and {right arrow over (J)} are the embeddings ofitems i and j, respectively, σ is a sigmoid function, W_(g)=θ_(g)^(T)θ_(g) is a k*k symmetric positive semi-definite matrix, k is thedimension of the embeddings {right arrow over (i)} and {right arrow over(J)}, the second set of learnable parameters θ_(g) is a matrix ofdimension k′*k, and k′ is an integer one or more orders of magnitudelower than k. 42-45. (canceled)
 46. The method as claimed in claim 1,wherein the learnable mixing function is defined by g(u, i, j;θ_(g))=σ({right arrow over (i)}^(T).W_(g).{right arrow over (J)}), where{right arrow over (i)} and {right arrow over (J)} are the embeddings ofitems i and J, respectively, σ is a sigmoid function, andW_(g)=diag(θ_(g)) is a k*k diagonal matrix whose diagonal is given byθ_(g) and whose other values are zero, k is the dimension of theembeddings {right arrow over (u)} and {right arrow over (i)}, and thesecond set of learnable parameters θ_(g) is a vector of dimension k. 47.The method as claimed in claim 2, wherein the learnable mixing functionis defined by g(u, i, j; θ_(g))=σ({right arrow over(i)}^(T).W_(g).{right arrow over (J)}), where {right arrow over (i)} and{right arrow over (J)} are the embeddings of items i and j,respectively, σ is a sigmoid function, and W_(g)=diag(θ_(g)) is a k*kdiagonal matrix whose diagonal is given by θ_(g), and whose other valuesare zero, k is the dimension of the embeddings {right arrow over (u)}and {right arrow over (i)}, and the second set of learnable parametersθ_(g) is a vector of dimension k.
 48. The method as claimed in claim 4,wherein the learnable mixing function is defined by g(u, i, j;θ_(g))=σ({right arrow over (i)}^(T).W_(g).{right arrow over (J)}), where{right arrow over (i)} and {right arrow over (J)} are the embeddings ofitems i and J, respectively, σ is a sigmoid function, andW_(g)=diag(θ_(g)) is a k*k diagonal matrix whose diagonal is given byθ_(g) and whose other values are zero, k is the dimension of theembeddings {right arrow over (u)} and {right arrow over (i)}; and thesecond set of learnable parameters θ_(g) is a vector of dimension k. 49.The method as claimed in claim 6, wherein the learnable mixing functionis defined by g(u, i, j; θ_(g))=σ({right arrow over(i)}^(T).W_(g).{right arrow over (J)}), where {right arrow over (i)} and{right arrow over (J)} are the embeddings of items i and j,respectively, σ is a sigmoid function, and W_(g)=diag(θ_(g)) is a k*kdiagonal matrix whose diagonal is given by θ_(g) and whose other valuesare zero, k is the dimension of the embeddings {right arrow over (u)}and {right arrow over (i)}, and the second set of learnable parametersθ_(g) is a vector of dimension k.
 50. The method as claimed in claim 8,wherein the learnable mixing function is defined by g(u, i, j;θ_(g))=σ({right arrow over (i)}^(T).W_(g).{right arrow over (J)}), where{right arrow over (i)} and {right arrow over (J)} are the embeddings ofitems i and j, respectively, σ is a sigmoid function, andW_(g)=diag(θ_(g)) is a k*k diagonal matrix whose diagonal is given byθ_(g) and whose other values are zero, k is the dimension of theembeddings {right arrow over (u)} and i, and the second set of learnableparameters θ_(g) is a vector of dimension k. 51-54. (canceled)
 55. Themethod as claimed in claim 8, wherein the entity embedding at time t isgiven by {right arrow over (u)}_(t)=μ_(u) _(t) ⁺−β⊙μ_(t) _(t) ⁻, whereinμ_(u) _(t) ⁺ is the mean of the embeddings of the items clicked by theentity u up to time t and μ_(u) _(t) ⁻ is the mean of the embeddings ofthe items non-clicked by the entity up to time t over a list of K itemsrecommended to the entity, β is a vector of learnable parameters ofdimension k that quantifies to which extent the non-clicked items areuseful to characterize entity preferences, and ⊙ is the element-wiseproduct of two vectors. 56-63. (canceled)